Welcome to our advanced R project working on a marketing campaign data set focussing on term deposit offers! You can find this project also under the following GitHup Repository. This project focuses on the classification problem that a Portugese bank was facing, which can be found under following link. The bank wanted to know which customers are more likely to accept a term deposit offer in a telemarketing campaign. Thus a predicitive model to answer this question is created in order to optimize the bank’s marketing resources. The general idea behind this is that it is preferrable to call customers that are more likely to accept the offer because each call means marketing cost and with such a predicitive model not only cost can be controlled better but also the actual return on the marketing campaign can be improved.

Our worked is divided into two parts: Data exploration and modelling (including feature engineering). Thus two markdowns are provided to give a better overview.

But before we go into the project itself, lets introduce our team tackled this problem by dividing the machine learning process into three different workstreams:

  1. Data exploration and visualization: Christine, Victor, Celine – 2. Data engineering: Tomas, Theo – 3. Modelling - Dan, Andrew
O1 R Masters :)

O1 R Masters :)

Source loading

Before we start with the first part of our project, which is data exploration, we load the required R libraries from a separate R script called “libraries”.

source('notebooks/libraries.R')

1. Data set

We were provided with two data sets. The first is a train data set, which includes the actual outcome of the marketing activity for each contacted customer (labeled train data set). The second is a “blind” test data set that does not include the actual outcome of the campaign. At the end of our project we will use our best model to make predictions on this test data set. One can imagine this like putting the model into production a feeding it with new data in order to predict which customers will accept the offer and which will not. The overall data set contains 45,211 rows and the original data set can be found in the UCI Machine Learning Repository. It is important to note that the data sets we were provided with are not 100% the same compared to the repository. Some of the features from the repository are not included here while other features such as balance, educatio, day are added or transformed. However we know based on the information on the data set provided that the data covers activities from from May 2008 to November 2010. Comparing the size of the train and the blind test set we see that the over all data set was split into 80% train and 20% test.

The provided data sets contain general as well as banking specific information about each contacted client. Such data can usually be found in the CRM system of a bank. The bank intends to use this data in order to identify patterns that a related to the likeliness of a customer responding positive to the offered marketing campaign. It is important to notice that this campaign using phone calls is about a term deposit offer. A term deposit is a specific banking product that is a fixed-term investment that includes the deposit of money into an account at a financial institution. Term deposit investments usually carry short-term maturities ranging from one month to a few years and will have varying levels of required minimum deposits. Thus the developed predicted model should only be used for marketing campaigns around term deposits since the same customer can respond completly different when he is offered a different banking product such as a over draft protection or a loan.

Now that we know what the data set is about in general let us take a closer look at it and see what it actually contains in terms of features.

1.1 Data loading

Both data sets include the same 16 features while the train set contains an additional column storing the outcome of the marketing campaign, which represents our target variable. Since the target variable is either yes (term deposit offer accepted) ot no (term deposit offer not accepted), we are facing a binary classification problem.

The next step is to take a closer look at the provided features.

Feature overview

  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
  3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’)
  4. education (categorical: ‘primary’,‘secondary’,‘tertiary’,‘unknown’)
  5. default: has credit in default? (categorical: ‘no’,‘yes’)
  6. balance: current account balance (numeric)
  7. housing: has housing loan? (categorical: ‘no’,‘yes’)
  8. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
Other attributes:
  1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)
  3. previous: number of contacts performed before this campaign and for this client (numeric)
  4. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘other’,‘success’, ‘unknown’)
Output variable (desired target):
  1. y: has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
raw_train_data<-fread('Data/BankCamp_train.csv', stringsAsFactors = F)
raw_test_data<-fread('Data/BankCamp_test.csv', stringsAsFactors = F)

str(raw_train_data)
## Classes 'data.table' and 'data.frame':   36168 obs. of  17 variables:
##  $ age      : int  50 47 56 36 41 32 26 60 39 55 ...
##  $ job      : chr  "entrepreneur" "technician" "housemaid" "blue-collar" ...
##  $ marital  : chr  "married" "married" "married" "married" ...
##  $ education: chr  "primary" "secondary" "primary" "primary" ...
##  $ default  : chr  "yes" "no" "no" "no" ...
##  $ balance  : int  537 -938 605 4608 362 0 782 193 2140 873 ...
##  $ housing  : chr  "yes" "yes" "no" "yes" ...
##  $ loan     : chr  "no" "no" "no" "no" ...
##  $ contact  : chr  "unknown" "unknown" "cellular" "cellular" ...
##  $ day      : int  20 28 19 14 12 4 29 12 16 3 ...
##  $ month    : chr  "jun" "may" "aug" "may" ...
##  $ duration : int  11 176 207 284 217 233 297 89 539 131 ...
##  $ campaign : int  15 2 6 7 3 3 1 2 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 276 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 2 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(raw_test_data)
## Classes 'data.table' and 'data.frame':   9043 obs. of  16 variables:
##  $ age      : int  58 43 51 56 32 54 58 54 32 38 ...
##  $ job      : chr  "management" "technician" "retired" "management" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "primary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 593 229 779 23 529 -364 1291 0 424 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "no" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 55 353 164 160 1492 355 266 179 104 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Classifying the variables between discrete and continuous variables

For our following data exploration we are creating two different vectors, one that stores all the features that are categorical and one with the numeric/continuous features.

discrete_var<-c("job", "marital", "education", "default", "housing", "loan","contact", "month", "poutcome")
  
continuous_var<-c("age", "balance","day", "duration", "campaign", "pdays", "previous")

2. Target distribution

First of all we take a look how our target is distributed. This means looking at how many of the contacted customers accepted the offer compared to how many did not. We see here that most of the customers did not accept the offer (88.4%). Only 11.6% of the customers that were contacted actually accepted the offer. If we assume that the customers were contacted randomly without any preselection, we can say that only one out of nine customers accept the offer.

This insight is very important for our later machine learning model. Because if we would predict all of the customers are a ‘no’ we would end up with an accuracy of almost 90%. Also if ‘no’ is defined as the positive class and we predict all of the rows as ‘no’ we end up with very high sensitivity results (True positives/[True positives+False negatives]). However, the idea behind this model is to identify customers that are more likely to accept our offer and thus we need not only to define ‘yes’ as the positive class but also use a sampling method to train our model due to the high class imbalance.

target_dist<-ggplot(raw_train_data, aes(y, fill=y))+
              geom_bar() + scale_fill_manual(values=c("#995052", "#529950")) +
              xlab("Campaign offer accepted")+
              geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
              position=position_dodge(1), size=3, vjust = -.5)+
              ggtitle("Distribution of target variable") + theme_minimal() + 
              theme(text = element_text(face = "bold"), legend.position = "none",
              panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
              plot.title = element_text(hjust = 0.5))
target_dist

3. Distribution of the continous variables

The next step is to look at the distribution of our features. This helps us to identify if some variables are skewed or unbalanced. We will start in this case with continous variables and look at the quintiles (including min and max), the mean and the standard deviation. This helps us especially when looking at the minimum to identify special cases such as pdays -1 compared to only looking at the histograms.

In the case of pdays we see we have more than 75% of the rows that contain a -1 which means the customer was never contacted before. This is also reflected in the feature previous that shows a 0 for most customers, counting the contacts performed before this campaign.

cont_var <- as.data.frame(raw_train_data)
cont_var <- cont_var[ , (names(cont_var) %in% continuous_var)]
distribution <- as.data.frame(t(sapply(cont_var, quantile)))
distribution$Mean <- sapply(cont_var, mean)
distribution$SD <- sapply(cont_var, sd)
datatable(round(distribution, 2))

3.1 Overview of the distribution of all the continuous variables

Looking at the density plots of out continous variables we can see that all of them are highly skewed to the right. The exception here is the day which reflects the day within the month. However, we see a variation across days and not a uniform distribution meaning we have days within the month were more customer are called compared to others.

cont_var_melt <- as.data.frame(melt(cont_var))

cont_dist <- ggplot(cont_var_melt, aes(value)) +
        geom_density(aes(fill = variable)) + 
        facet_wrap(~variable, scales = "free", nrow = 3) +
        labs(x = "", y = "", fill = "") + theme_minimal() +
        scale_fill_tableau() + ggtitle("Distribution of each continous variable") +
        theme(text = element_text(face = "bold"), legend.position = "none",
              panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
             plot.title = element_text(hjust = 0.5))
cont_dist

3.2 Distribution of each continuous variable

The next

## Create a new column in the test dataset 
raw_test_data$y <- NA

## Creating a column "dataType" for both train and test datasets and assign the value 'train' & 'test'
raw_train_data$dataType <- "train"
raw_test_data$dataType <- "test"


## Merging both train and test datasets 
dataset <- rbind(raw_train_data, raw_test_data)

Comparing train and test distirbution of continuous features

Age

ggplot(dataset, aes(x=age, color = dataType)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Age Distribution") + theme_classic() +
        scale_color_manual(values=c("#e08926", "#3526e0"))

Balance

ggplot(dataset, aes(x=balance, color = dataType)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Balance Distribution") + theme_classic() +
        scale_color_manual(values=c("#e08926", "#3526e0"))

Day

ggplot(dataset, aes(x=day, color = dataType)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Day Distribution") + theme_classic() +
        scale_color_manual(values=c("#e08926", "#3526e0"))

Duration

ggplot(dataset, aes(x=duration, color = dataType)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Duration Distribution") + theme_classic() +
        scale_color_manual(values=c("#e08926", "#3526e0"))

Campaign

ggplot(dataset, aes(x=campaign, color = dataType)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Campaign Distribution") + theme_classic() +
        scale_color_manual(values=c("#e08926", "#3526e0"))

Pdays

ggplot(dataset, aes(x=pdays, color = dataType)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("pdays Distribution") + theme_classic() +
        scale_color_manual(values=c("#e08926", "#3526e0"))

4. Distribution of the discrete variables

4.1 Overview of the distribution of the discrete variables

df_disc <- raw_train_data[, ..discrete_var]
df_disc <- sapply(df_disc, as.factor)
df_disc <- as.data.frame(melt(df_disc))

disc_dist <- ggplot(df_disc, aes(value)) +
      geom_bar(aes(fill = Var2)) + 
      scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
      scale_x_discrete(expand = c(0,0)) +
      facet_wrap(~Var2, scales = "free", nrow = 2) +
      scale_fill_tableau() +
      ggtitle("Count of each discrete variable") +
      labs(fill = "", x = "", y = "") +
      theme_minimal() +
      theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

disc_dist

4.2 Distribution of each discrete variable

Comparing train and test distirbution of discrete features

Job

job_train <- ggplot(raw_train_data, aes(x=job)) +
        geom_bar(fill ="#3526e0") +
        ggtitle("Job Distribution - Train") + labs(y = "", x="") + theme_minimal() + 
        theme(text = element_text(face = "plain"), legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

job_test <-  ggplot(raw_test_data, aes(x=job)) +
        geom_bar(fill = "#e08926") + ggtitle("Job Distribution - Test") + labs(y = "", x="") +         theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",  
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(job_train, job_test, ncol=2)

Marital

marital_train <- ggplot(raw_train_data, aes(x=marital)) +
        geom_bar(fill ="#3526e0") + ggtitle("Marital Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

marital_test <-  ggplot(raw_test_data, aes(x=marital)) +
        geom_bar(fill = "#e08926") + ggtitle("Marital Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(marital_train, marital_test, ncol=2)

Education

education_train <- ggplot(raw_train_data, aes(x=education)) +
        geom_bar(fill ="#3526e0") + ggtitle("Education Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

education_test <-  ggplot(raw_test_data, aes(x=education)) +
        geom_bar(fill = "#e08926") + ggtitle("Education Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(education_train, education_test, ncol=2)

Default

default_train <- ggplot(raw_train_data, aes(x=default)) +
        geom_bar(fill ="#3526e0") + ggtitle("Default Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() +    theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

default_test <-  ggplot(raw_test_data, aes(x=default)) +
        geom_bar(fill = "#e08926") + ggtitle("Default Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(default_train, default_test, ncol=2)

Housing

housing_train <- ggplot(raw_train_data, aes(x=housing)) +
        geom_bar(fill ="#3526e0") + ggtitle("Housing Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() +    theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

housing_test <-  ggplot(raw_test_data, aes(x=housing)) +
        geom_bar(fill = "#e08926") + ggtitle("Housing Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(housing_train, housing_test, ncol=2)

Loan

loan_train <- ggplot(raw_train_data, aes(x=loan)) +
        geom_bar(fill ="#3526e0") + ggtitle("Loan Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

loan_test <-  ggplot(raw_test_data, aes(x=loan)) +
        geom_bar(fill = "#e08926") + ggtitle("Loan Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(loan_train, loan_test, ncol=2)

Contact

contact_train <- ggplot(raw_train_data, aes(x=contact)) +
        geom_bar(fill ="#3526e0") + ggtitle("Contact Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

contact_test <-  ggplot(raw_test_data, aes(x=contact)) +
        geom_bar(fill = "#e08926") + ggtitle("Contact Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(contact_train, contact_test, ncol=2)

Month

month_train <- ggplot(raw_train_data, aes(x=month)) +
        geom_bar(fill ="#3526e0") + ggtitle("Month Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

month_test <-  ggplot(raw_test_data, aes(x=month)) +
        geom_bar(fill = "#e08926") + ggtitle("Month Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(month_train, month_test, ncol=2)

Previous campaign

poutcome_train <- ggplot(raw_train_data, aes(x=poutcome)) +
        geom_bar(fill ="#3526e0") + ggtitle("poutcome Distribution - Train") + 
        labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

poutcome_test <-  ggplot(raw_test_data, aes(x=poutcome)) +
        geom_bar(fill = "#e08926") + ggtitle("poutcome Distribution - Test") + 
        labs(y = "", x="") + theme_minimal() +  theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(size = 7, angle = 90),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

grid.arrange(poutcome_train, poutcome_test, ncol=2)

5. Correlation plot

correlation <- cor(cont_var)

p <- plot_ly(
  x = continuous_var,
  y=continuous_var,
  z = correlation,
  type = "heatmap",
  colorscale=list(c(0, "rgb(0,0,255)"),
                  list(0.1, "rgb(51,153,255)"),
                  list(0.2, "rgb(102,204,255)"),
                  list(0.3, "rgb(153,204,255)"),
                  list(0.4, "rgb(204,204,255)"),
                  list(0.5, "rgb(255,255,255)"),
                  list(0.6, "rgb(255,204,255)"),
                  list(0.7, "rgb(255,153,255)"),
                  list(0.8, "rgb(255,102,204)"),
                  list(0.9, "rgb(255,102,102)"),
                  list(1, "rgb(255,0,0)")),
  title = "Correlation Matrix"
)

p <- layout(p, title="Continuous Variable Correlations")

p
## Warning: 'heatmap' objects don't have these attributes: 'title'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'z', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'transpose', 'xtype', 'ytype', 'zsmooth', 'connectgaps', 'xgap', 'ygap', 'zhoverformat', 'zauto', 'zmin', 'zmax', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

6. Outlier Analysis

Outlier Analysis for continuous and discrete variables

Continuous Variables

cont_box <- ggplot(cont_var_melt, aes(variable, value)) +
          geom_boxplot(aes(fill = variable)) +
          coord_flip() +                                
          scale_fill_tableau() +
          labs(x = "", y = "") +
          theme_minimal() +
          theme(text = element_text(face = "bold"),
                legend.position = "none",
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
               plot.title = element_text(hjust = 0.5),
               axis.text.x = element_blank())                                
                            
                      
cont_box

Continuous Scaled Variables

df_cont_norm <- raw_train_data[,..continuous_var]
df_cont_norm <- as.data.frame(apply(df_cont_norm, 2,function(x)((x - min(x))/(max(x)-min(x)))))
df_cont_norm <- as.data.frame(melt(df_cont_norm))

cont_box_norm <- ggplot(df_cont_norm, aes(variable, value)) +
              geom_boxplot(aes(fill = variable)) +
              coord_flip() +                                
              scale_fill_tableau() +
              labs(x = "", y = "") +
              theme_minimal() +
              theme(text = element_text(face = "bold"),
                    legend.position = "none",
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                   plot.title = element_text(hjust = 0.5),
                   axis.text.x = element_blank())                                
                            
                      
cont_box_norm

Discrete Variables

disc_box <- ggplot(df_disc, aes(Var2, as.numeric(value))) +
                geom_boxplot(aes(fill = Var2)) +
                scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
                scale_x_discrete(expand = c(0,0)) +
                facet_wrap(~Var2, scales = "free", ncol = 1) +
                scale_fill_tableau() +
                ggtitle("Distribution of each discrete variable") +
                labs(fill = "", x = "", y = "") +
                coord_flip() +
                theme_light() +
                theme(text = element_text(face = "bold"),
                      legend.position = "none",
                      axis.text.x = element_blank(),
                      panel.grid.major = element_blank(),
                      panel.grid.minor = element_blank(),
                      plot.title = element_text(hjust = 0.5),
                      strip.background = element_blank(),
                      strip.text.x = element_blank())

disc_box

7. Target distribution across features

7.1 Target distribution across continuous features

Age

ggplot(raw_train_data, aes(x=age, color = y)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Age Target Distribution") + theme_classic() + 
        labs(color = "Offer accepted") +
        scale_color_manual(values=c("#995052", "#529950"))

Balance

ggplot(raw_train_data, aes(x=balance, color = y)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Balance Target Distribution") + theme_classic() +
        labs(color = "Offer accepted") +
        scale_color_manual(values=c("#995052", "#529950"))

Day

ggplot(raw_train_data, aes(x=day, color = y)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Day Target Distribution") + theme_classic() +
        labs(color = "Offer accepted") +
        scale_color_manual(values=c("#995052", "#529950"))

Duration

ggplot(raw_train_data, aes(x=duration, color = y)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Duration Target Distribution") + theme_classic() +
        labs(color = "Offer accepted") +
        scale_color_manual(values=c("#995052", "#529950"))

Campaign

ggplot(raw_train_data, aes(x=campaign, color = y)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Campaign Target Distribution") + theme_classic() +
        labs(color = "Offer accepted") +
        scale_color_manual(values=c("#995052", "#529950"))

pdays

ggplot(raw_train_data, aes(x=pdays, color = y)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("pdays Target Distribution") + theme_classic() +
        labs(color = "Offer accepted") +
        scale_color_manual(values=c("#995052", "#529950"))

previous

ggplot(raw_train_data, aes(x=previous, color = y)) +
        geom_density(alpha = 0.7, size=1.5) +
        ggtitle("Previous Target Distribution") + theme_classic() +
        labs(color = "Offer accepted") +
        scale_color_manual(values=c("#995052", "#529950"))

7.2 Target distribution across discrete features

Job

 ggplot(raw_train_data,mapping = aes(job,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Marital

ggplot(raw_train_data,mapping = aes(marital,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Education

ggplot(raw_train_data,mapping = aes(education,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Default

ggplot(raw_train_data,mapping = aes(default,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Housing

ggplot(raw_train_data,mapping = aes(housing,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Loan

ggplot(raw_train_data,mapping = aes(loan,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Contact

ggplot(raw_train_data,mapping = aes(contact,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Month

ggplot(raw_train_data,mapping = aes(month,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))

Previous campaign

ggplot(raw_train_data,mapping = aes(poutcome,fill=y))+
  geom_bar(col="black", position = position_dodge(1))+
  geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
            position=position_dodge(1), size=3, vjust = -.5)+
  theme_classic() +
  scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))